Global, Exact Cosmic Microwave Background Data Analysis Using Gibbs Sampling 



CO 

o 
o 

CN 
-t— > 
O 

O 

in 

CN 

> 
o 

oo 
O 
O 

CO 

o 

^: 

Or 

6 ■ 



X 



Benjamin D. Wandelt, 1,2 'Q David L. Larson, 1 and Aran Lakshminarayanan 1 

1 Department of Physics, UIUC, 1110 W Green Street, Urbana, IL 61801 
2 Department of Astronomy, UIUC, 1002 W Green Street, Urbana, IL 61801 
(Dated: February 2, 2008) 

We describe an efficient and exact method that enables global Bayesian analysis of cosmic mi- 
crowave background (CMB) data. The method reveals the joint posterior density (or likelihood for 
flat priors) of the power spectrum Ci and the CMB signal. Foregrounds and instrumental param- 
eters can be simultaneously inferred from the data. The method allows the specification of a wide 
range of foreground priors. We explicitly show how to propagate the non-Gaussian dependency 
structure of the Ct posterior through to the posterior density of the parameters. If desired, the 
analysis can be coupled to theoretical (cosmological) priors and can yield the posterior density of 
cosmological parameter estimates directly from the time-ordered data. The method does not hinge 
on special assumptions about the survey geometry or noise properties, etc. It is based on a Monte 
Carlo approach and hence parallelizes trivially. No trace or determinant evaluations are necessary. 
The feasibility of this approach rests on the ability to solve the systems of linear equations which 
arise. These are of the same size and computational complexity as the map-making equations. We 
describe a pre-conditioned conjugate gradient technique that solves this problem and demonstrate 
in a numerical example that the computational time required for each Monte Carlo sample scales 
as Up^ 2 with the number of pixels n p . We test our method using the COBE-DMR data and explore 
the non-Gaussian joint posterior density of the COBE-DMR Ce in several projections. 



I. INTRODUCTION 



The observation and analysis of cosmic microwave 
background (CMB) anisotropics have attracted a great 
deal of attention in recent years due to their unique rel- 
evance for cosmological theory (see p| for a recent re- 
view). A slew of observational results have been pub- 
lished during the last two years 0. These were obtained 
from maps of the microwave sky at ever increasing sensi- 
tivity and resolution. Since the recent release of the first 
year WMAP data, an all-sky microwave survey has been 
available down to angular scales of 12 minutes of arc 
By the end of the decade the Planck satellite is expected 
to generate 1 Terabyte of high resolution, high sensitivity 
all-sky data. 

The basic assumption is that the CMB anisotropy sig- 
nal and the instrumental noise are Gaussian and that 
the signal statistics are isotropic on the sky. Con- 
tact between theory and observation is then best made 
by extracting the angular power spectrum Ci from the 
data0,0,0. Methods for efficiently estimating the power 
spectrum have been investigated since the computational 
unfcasiblity of using the brute-force approach was re- 
alized 0, @- This effort has yielded two classes of 
methods: exact methods, applicable only to two nar- 
rowly defined classes of observational strategies 0, ^| , 
and approx imate but more broadly applicable methods 
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We will describe here a solution to the problem of in- 
ference from microwave background data which combines 
the advantages of exact methods with the practicality of 
the approximate methods. The computational cost of our 
method scales like the best approximate method for the 
same experiment, albeit with a larger pre-factor. Power 
spectrum estimates and any desired characterization of 
the (multivariate) statistical uncertainty in the estimates 
can be computed free from any approximations in the es- 
timator which could lead to sub-optimality or biases. 

The solution we propose is to sample the power spec- 
trum (as well as other desired quantities, such as the 
underlying CMB signal, foregrounds or the noise proper- 
ties of the instrument) directly from the joint likelihood 
(or posterior) density given the data. We can efficiently 
sample from this multi-million dimensional density us- 
ing the Gibbs sampler. This approach obviates the need 
to evaluate the likelihood or its derivatives in order to 
analyze CMB data. 

Our approach shares certain algorithmic features with 
the approach independently discovered in which de- 
scribes a maximum likelihood estimator of the power 
spectrum using Bayesian Monte Carlo methods. How- 
ever, our goal from the outset was to design a method 
that allows a full exploration of the multivariate proba- 
bility density of the power spectrum and the parameter 
estimates, given the data. 

Our method seamlessly integrates with parameter es- 
timation without recourse to semi-analytic Gaussian, off- 
set log- normal x 2 or hybrid [2(| approximation 
schemes. If desired, theoretical priors can be applied in 
the analysis by restricting the space of power spectra 
to those which arise from a physical model of the CMB 
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anisotropy. 

By design, the sample of power spectra and recon- 
structed sky maps will reflect the statistical uncertainty 
given the data through the full non-Gaussian statistical 
dependence structure of the Cg estimates. This infor- 
mation can be propagated losslessly to the cosmological 
parameter estimates. 

One aspect of our method which is of general interest 
in astrophysics beyond CMB analysis is that it general- 
izes the results on globally optimal interpolation, filtering 
and reconstruction of noisy and censored data sets in |25j 
to self-consistently include inference of the signal covari- 
ance structure. This defines a generalized Wiener filter 
that does not need a priori specification of the signal co- 
variance. A byproduct of our method is a prescription 
for "unbiasing" the Wiener filter which clearly reveals the 
tight relation between Wiener filtering and power spec- 
trum estimation. 

Our methods differ from traditional methods of CMB 
analysis in a fundamental aspect. Traditional methods 
consider the analysis task as a set of steps, each of which 
arrives at intermediate outputs which are then fed as 
inputs to the next step in the pipeline. Our approach is 
a truly global analysis, in the sense that the statistics of 
all the science products are computed jointly, respecting 
and exploiting the full statistical dependence structure 
between the various components. 

In summary, our method is a Monte Carlo technique 
which samples power spectra and other science products 
from their exact, multivariate a posteriori probability 
density and which does so without explicitly evaluat- 
ing it. The result is a detailed characterization of the 
statistics of the CMB signal on the sky, reconstructed 
foregrounds, the CMB power spectrum, and the cosmo- 
logical parameters inferred from it with a cost which is 
proportional to the cost of a least squares map-making 
algorithm for the same set of observations. 

In section ITT1 we introduce notation and a general sta- 
tistical model of CMB observations. Our method is de- 
scribed in detail in section IIIII In section IIVI we com- 
ment on the perspective our Bayesian approach offers 
on cosmic variance. We discuss the numerical and com- 
putational techniques used to implement our method in 
section |V| and apply it to the COBE-DMR data m sec- 
tion lVIl We reflect on where we stand and conclude with 
comments on further work to be done in section rVlII 



II. MODEL AND NOTATION 

We begin by defining our model of CMB observations 
and introduce our notation. We imagine that the actual 
CMB sky s is observed with some optical system and ac- 
cording to some observing strategy encoded in a pointing 
matrix A, which maps the signal on the sky into a col- 
lection of n a time-ordered observations of the sky. This 
results in the "raw" data d, represented by a vector with 
n elements (an n -vector). Our model of this process is 



encoded in the model equation 

d = A(s + f)+n 



lad 



(i) 



where n is a realization of Gaussian instrumental noise 
added to the data and / = J2i fi is the sum of a collection 
of foregrounds (assumed spatially varying and constant 
in time). We represent maps on the sky with n v resolu- 
tion elements (pixels) cLS Tip vectors. Note that while we 
do not explicitly consider multi-channel data, the model 
is easily generalized to that case by adding a frequency 



index to d, A, 



-.tod 



and /. 



The "map" vector m is the least squares estimate of 
the signal s + f from d. Because we assume Gaussian 
noise with zero mean this is also the maximum likelihood 
estimate (or maximum a posteriori estimate assuming a 
flat prior). It can be found as the solution of the normal 
equation 



^N-lAm = A T N-\d. 



(2) 



Here the matrix Nt nr i is the covariance matrix of the noise 



in the time ordered data space N to d = (n 



tod„tod T 



). Then 



m = s + f + n where n describes the residual noise on 
the map estimate with covariance matrix N = (nn T ) — 
(A T N-]A)-\ 

The cosmological model specifies the signal covariance 
matrix S. For isotropic theories S is diagonal in the 
spherical harmonic basis, with the special form Se m e>m' = 

In keeping with the majority of the literature in the 
field, we restrict our discussion to theories which predict 
a Gaussian CMB signal s. It will be convenient to ab- 
breviate Gaussian multivariate densities as 



G{m,C) 



1 



exp ( —m T C 



(3) 



III. METHOD 

A. Overview 

For a cleaner exposition of the method, we will ignore 
the foregrounds / for now and return to their inclusion 
later. We are trying to explore the a posteriori density 



P(Cg\m) cx G(m, S{C e ) + N)P(C e ) 



(4) 



where P{Cg) is the density encoding prior information 
on the Cg. Up to normalization this can be obtained by 
marginalizing the joint density 



P(C e ,s,m) = P(m\s)P(s\Cg)P{Cg) 



(5) 



over the signal s. Setting P(Cg) — const makes this anal- 
ysis equivalent to an exact frequentist likelihood analysis. 
We will discuss other choices of prior later. 

Traditionally, the approach to exploring the a poste- 
riori density has been to define an estimator, such as 
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the least squares quadratic (LSQ) estimator j5| or the 
maximum likelihood (ML) estimator [jj. Then some 
measure of uncertainty in the values of this estimator 
was defined, for instance by approximating the shape of 
P(Ce\m) around the maximum by a multivariate Gaus- 
sian and evaluating elements of the curvature matrix at 
the extremum. 

Evaluating the LSQ or ML estimators is a very costly 
operation, taking 0(n.p) operations^- In general, eval- 
uating the curvature matrix is even more costly because 
it has 0{n p ) elements each of which requires 0(nZ) opera- 
tions, making the overall operation count of order 0{n^). 
In addition a Gaussian approximation fails at low I where 
the small number of degrees of freedom makes the poste- 
rior significantly non-Gaussian, and also at high t in the 
regime of small signal-to-noise (S/N < 1). 

Instead, we propose to sample parameter values Cg 
from the posterior directly. There is no known way to 
directly sample from Eq0] but if a way can be found to 
sample s and Ce from the joint distribution Eq. JSJ then 
the Cg taken by themselves are exact samples from the 
marginalized distribution. 

At first, sampling from the joint distribution seems 
even less feasible. But powerful theorems can be proved 
[2^| that show that if it is possible to sample from the 
conditional distributions P(s\Ce,m) and P(Ci\s,m) oc 
P(Ci\s) then one can sample from the joint distribution 
in an iterative fashion. Begin with some starting guess 
C°. Then iterate the following equations 

s i+1 ^P{s\C\,m) (6) 

C\ +1 «- P{C t \s l+1 ) (7) 

then after some "burn- in" the (C\,s l ) converge to being 
samples from the joint distribution Eq. (J3J). This tech- 
nique of sampling from the joint distribution is called the 
Gibbs sampler. 

B. Sampling Techniques 

To implement these ideas one needs the forms of the 
conditional densities and recipes for sampling from these 
distributions. These follow now. 

The conditional density of the signal given the most 
recent Ci sample is just a multivariate Gaussian 

P{s\C\, m) oc G {S\S l + N)~ l m, ((ST 1 + A" 1 )- 1 ) , 

( g ) 

where S l = S(C\). This will be recognized as the pos- 
terior density of the Wiener Filter given the most recent 
power spectrum estimate. 

The density for the power spectrum coefficients Cg fac- 
torizes due to the special form of S. 

P(C,| S i )ocP(C,)n^=expf--i- jr \sU 2 ) 

(9) 



The s\ m are the spherical harmonic coefficients of s 1 . 
This density is known as the inverse Gamma distribution 
of order 21 — 1 . This result has interesting implications 
for cosmic variance in this Bayesian framework, which we 
will discuss below. 

To sample from Eq. © we need to generate a Gaussian 
variate with the given mean and covariance. A numer- 
ically convenient choice (see section [Vj of the equation 
for the mean x is 

(1 + S l ^N^S 1 ?)S l = S l IN' 1 ™. (10) 

In fact it is easier to solve for z = S l ~ix and to then 
solve for x trivially. Note from its definition above that 
N _1 x is easier to compute than Ax. If N to d is circulant 
(stationary noise) or block-circulant (a popular choice 
for non-stationary noise), Nf od x can be computed using 
FFTs. If A is diagonal to very good accuracy then com- 
puting N^ 1 easy. We will drop the i superscript in what 
follows. 

We chose to write the equation in terms of the map 
made from the data. It easy to see from Eq. J5J and from 
N = {A T N^ o 1 d A)~ 1 that writing Eq. |JTD) in terms of the 
TOD saves some computations: iV _1 m = A T Nzld. This 
replacement can be made throughout in the equations 
that follow in the remainder of this paper. 

Then we need to add a fluctuation term to this mean 
to get a random variate. This is non-trivial, because we 
need to simulate a map with covariance (S^ 1 + TV -1 ) -1 
without being able to compute square roots of this ma- 
trix. We can, however, compute the square root of S 
because it is diagonal in spherical harmonic space and 

we can compute = A T N t 2 od by using FFTs on the 

time-ordered data. This leads to the following solution: 
generate two p-vectors £ and x of independent Gaussian 
random variates, with zero mean and unit variance (these 
are called normal variates) . Then solve the linear set of 
equations 

(l + S^N~ 1 S^S-^y = £ + siN-ix (H) 

for y. It is easy to verify that this does give the right 
covariance by computing (yy T ). The final result is then 

s i+1 = x + y, (12) 

where we have re- introduced the superscript. 

Note that each s is a perfect pure signal sky (up to the 
assumed band-limit) with covariance S. While x is the 
Wiener filter, whose power spectrum would be a biased 
estimator of the Cg, s is "unbiased". The addition of the 
fluctuating term y has replaced filtered noise fluctuations 
with synthetic signal fluctuations. 

Drawing the C» +1 from the inverse Gamma distri- 
bution, Eq. @, is very simple. For each i, compute 

°t ~ Em=-< \ s \m\ 2 an d generate a {21 — l)-vector pi of 
Gaussian random variates with zero mean and unit vari- 
ance. Then 
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where the denominator is the square norm of pg. 



C. Foregrounds 

Traditionally, regions on the sky are excised if the 
residual error after foreground subtraction is large. How- 
ever, modeling the signal on the remainder of the sky af- 
ter foreground cuts complicates the structure of the signal 
covariance matrix S. Instead, we choose to model fore- 
grounds as an additional component in the model equa- 
tion, as shown in Eq. (JTJ. Then the joint density in Eq. 
(0) becomes 

P(Ct, s, {/,}, d) = P(d\s, {f 3 })P(s\C e )P(C e ) J] P(f k ) 

k 

(14) 

where each P(fk) contains prior information about the 
fcth foreground. 

Following the Gibbs sampler approach we draw from 
the foreground components given the data. We group 
different logically separate foregrounds by adding in ad- 
ditional steps in the sampling chain 

for every j : «- P(f 3 \C l e , s\ {f k<3 } l+ \ {f k>3 } 1 ) 

«- P(s\Cj,{f j } i+1 ) (15) 
c; +1 «- P(C e \s i+1 ) 

Where appropriate, different foreground components 
may be grouped together into one fj. The algorithm 
to sample from the conditional foreground densities is 
analogous to the signal sampling algorithm described in 
the previous subsection. We will return to algorithmic 
issues after discussing the foreground prior . P(f 3 ). 

How do we specify the foreground prior? For in- 
stance, we might want to be completely insensitive to 
certain foreground terms {fi}- This would mean setting 
P(f) = G(f, FF T ), where FF T = a) fjf and each 
vector fi represents a foreground contribution we would 
like to project out. The matrix F is just constructed 
by columns from the fi. The variance is numerically 
"infinite" , i.e. large compared to any other noise source. 
This specifies maximal ignorance about the amplitude of 
this foreground component. As an example, /, could be 
the monopole and the three dipole components. Or, if 
the foreground contribution in a pixel j was completely 
unknown, the fi = lj where lj is the vector represent- 
ing the map which is zero everywhere except in the pixel 
j. Essentially any spatial template to which we want 
the power spectrum estimation to be insensitive can be 
added in here, and they can be grouped in computation- 
ally convenient ways in Eq. H15(l . 

It is important to note that even though we may have 
specified "infinite" variance in our prior, the foreground 
samples produced will be constrained by the data and 
hence will be informative about the level of the fore- 
ground contribution supported by the data. For example, 



the sample of the three dipole components generated dur- 
ing the iteration of Eq. l|15f) in the example above would 
be informative about the direction and amplitude of the 
CMB dipole, and could be used to calibrate the experi- 
ment. 

Different choices for the foreground prior P{f) are pos- 
sible. It could include information on foreground tem- 
plates as well as a specification of our uncertainty in these 
templates. For example if the template is / and our un- 
certainty could be described by a Gaussian centered on / 
with covariance FF T then P(f) = G{f - /, FF T ). One 
way to specify / and FF T would be to simulate a set 
of possible theoretical foreground models fi with weights 
Wi, such that J^tWj = 1, and to then set / = E Wifi and 

FF T = T. i m(fi-fM-I) T . 

Note that the assumption of a Gaussian prior P(f) 
only assumes that our ignorance of the foreground con- 
tribution can be expressed through a Gaussian covari- 
ance structure — the foregrounds are not assumed to have 
Gaussian statistics. Non-Gaussianity can be explicitly 
assumed by choosing a non-Gaussian template /. For 
the case of multi-frequency data, P(f) could encode what 
is known about the dependence of certain physical fore- 
ground components on the frequency. 

Returning to the mechanics of sampling Eq. 1151) we 
write J-j = FjFj , and solve at the jth step 

(Fi+FjN A r : f)x 3 = Ji+FiN-^m s-^ fi)> ( 16 ) 
and 

Then fj = Tj(xj + yj). Since T 3 may not be full rank 
in n p dimensions, the equations here may be understood 
as shorthand for the projected equations in the subspace 
on which T 3 has full rank. 

Note that when foregrounds are considered, the m on 
the right hand side of Eq. (jlOJl has to be replaced with 
(m Ej fi)- 

In special cases it may be desirable to perform the 
marginalization over / analytically. Appendix ^ gives 
techniques for doing so. 

D. Noise model 

It is straightforward to extend our methods to include 
estimation of the noise covariance from the data them- 
selves. In the case that N to d is non-stationary and block- 
diagonal with circulant blocks (the standard assumption 
in CMB analysis), we can easily add a sampling step 
symbolically written as 

N} +1 P(N\{N k ^y, s, C t , {/,•}) = P(N\s, {/,•}). 

(18) 

The second equality expresses two facts: (1) for a block 
diagonal noise matrix the conditional density of one block 
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does not depend on the other blocks and (2) N is con- 
ditionally independent of the Cg given s; that is given s 
the Ci do not add more information about the N. 

In practice, the noise model would assume smoothness 
of the noise power spectra. If we write Njk for the kth 
band power spectral coefficient of the jth block of the 
noise covariance of the TOD simply involves computing 
the FFT of the jth segment of d — A(s + ^2f), adding 
the power in bands of width d and then sampling Njk 
from the inverse Gamma distributions of order d — 2. 

More general noise models can be implemented. We 
will explore the effect of more sophisticated modeling of 
non-stationary noise in future work. 

E. Parameter estimation 

Currently power spectrum estimation algorithms rely 
on approximate representations of the posterior density 
P(Cg\d) [23, for example in terms of multivariate Gaus- 
sian, shifted log-normal or hybrid representations. These 
approximations have to be fitted to sets of Monte Carlo 
simulations (2£j ■ Since they take simple analytical forms 
they can only be expected to be accurate near the peak 
of the posterior density. In order to faithfully propagate 
all the information in the C% estimates through to the 
parameter estimation step, efficient ways must be found 
to accurately represent and communicate P(Ce\d). 

The Bayesian estimation technique described in this 
paper provides a natural answer to this problem. The 
method generates a set of samples from P(Cg\d) which 
can simply be published electronically. Meaningful sum- 
maries of the properties of P(Ci\d) can all be calculated 
arbitrarily exactly, given a sufficient number of samples. 

The disadvantage of using this sample set for parame- 
ter estimation is that it does not lend itself easily to com- 
puting a numerical probability density for a theoretical 
Ci power spectrum computed from a set of cosmological 
parameters 9. 

However, a fortunate circumstance solves the problem 
of finding an arbitrarily exact numerical representation 
of P{Ci\d). At each iteration of the Gibbs sampler the Ci 
are drawn from P(Ct\s) which is in fact P{Ce\o~i) where 
°i = Em l s ^m| 2 - We can therefore write 

P{Ct\d) = J dsP{C e ,s\d)= J dsP{C e \s)P(s\d) 

= f Da t P{CMt)P^t\d) « — Vp(QK). 

J U G i 

(19) 

The sum (where the index runs over no Gibbs samples) 
becomes an arbitrarily exact approximation to the in- 
tegral as the number of samples increases. It is called 
the Blackwell-Rao estimator for the density and can be 
shown to be superior to binned representations. This sum 
yields a numerical representation of the posterior density 
of the power spectrum given the signal samples. All the 



information about P(Ci\d) is contained in the a}, which 
generate a data set of size 0{l max nQ). 

It is noteworthy that in the limit of perfect data, using 
Eq. (|19|l returns the exact posterior density after only 
one iteration of the Gibbs sampling algorithm. 

In addition to being a faithful representation of 
P(Ct\d) it is also a computationally efficient representa- 
tion. Evaluating the Gaussian or the shifted log-normal 
approximations to P(Ce\d) takes 0{t z max ) operations, 
while our approach requires only O(l max no) operations. 
Note also that any moments of P(Ce\d) can be calculated 
through 

(CiC e > . . ■Ce>)\ P (c t \d) ~ — V^CfCV ■ ■ • CV<)| P(C , 

This is a far more efficient representation than would be 
afforded by a Monte Carlo sample of a pseudo-C^ esti- 
mator since each of the terms on the right hand side can 
be computed analytically. 

Another feature of this framework is that is possi- 
ble to include cosmological parameter estimation in the 
joint analysis of the data. If we assume a class of the- 
oretical models, we can solve the estimation problem 
of power spectrum and cosmological parameters concur- 
rently. The assumption of such a class of models which 
amounts to choosing a prior for the power spectra which 
excludes spectra that could not possibly be the result 
from a solution of the Boltzmann equation for any com- 
bination of the parameters about which we wish to make 
inferences. 

With such an assumed class of models the relation- 
ship between Ct and the cosmological parameters 9 is a 
non-stochastic one, Q = Ci{Q), and P{Cg\9) is a delta 
function. We can integrate out this delta function in the 
posterior and then obtain the conditional density for sam- 
pling the cosmological parameters given the data. This 
procedure results in removal of the Ce sampling step and 
the addition of the following step to the list in Eq. I|15fl : 

9 l+1 <- P(C e (0)\s i+1 ). (21) 

Here P(Ci(9)\s l+1 ) denotes the inverse Gamma distribu- 
tion, Eq. ©, and Cg(9) is defined through cosmological 
theory. Instead of sampling from the £ m ax power spec- 
trum coefficients given the o~e we sample from 9 assuming 
that we just measured the o~g on a perfect signal sky (the 
last draw). In practice, that can be achieved by running 
a Markov Chain using the Metropolis Hastings algorithm 
until one independent 9 sample is produced. 

If we believe strongly in the theoretical framework, 
using this prior information is desirable: it reduces the 
number of parameters in the problem and therefore im- 
proves the signal and hence also the foreground recon- 
struction from the data. The set of Ci for the draws of 9 
represents stochastically what is known about the theo- 
retical power spectrum. This method defines an optimal 
non-linear filter which returns the best power spectrum 
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FIG. 1: Computing time averaged over 30 iterations of the 
Gibbs sampler required for solving Eq. (1101 and Eq. il l 1 1 as a 
function of the number of pixels in the map. These timings are 
for a single AthlonXP 1800+ CPU. Solid line: actual timings. 
Dashed lines show rip for x £ {3, 5/2, 2, 3/2} from the top to 
the bottom on the right side of the figure. 



and a characterization of the error while including phys- 
ical constraints on the analysis (for example the smooth- 
ness of the Ci which is related to the natural frequency 
of oscillations modes in the primordial plasma). 

However, just as we are interested in making maps 
from the data without inputting information about the 
foregrounds and the statistical properties (e.g. isotropy) 
of the CMB, we are also interested in the model indepen- 
dent power spectrum constraints. 



IV. COSMIC VARIANCE 

In Eq. JHJ we have written down the conditional pos- 
terior P(Ce\s). This encodes what we know about the 
Ce if we have perfect knowledge of the signal on the sky. 
The full posterior distribution P(Ce\d) would reduce to 
this if we had perfect (noiseless, all-sky) data. 

As shown in Eq. @ the Ce only depend on the data 



through a e = YX1= 



-e \°lm\ 



These oe have a physical in- 



terpretation. They measure the actual fluctuation power 
on our sky. Therefore, if we had perfect data it would be 
possible to measure the 07 with zero variance. 

The residual uncertainty in Ce even for perfect data 
is a well known fact, known as cosmic variance. It is 
the consequence of having only one sky at our disposal, 
which means that there are a limited number of degrees 
of freedom for each Cg. Hence we cannot measure the Ce 
arbitrarily precisely. 

In this Bayesian treatment the functional form of the 
conditional posterior density may be unexpected. In the 
frcqucntist approach where the true underlying theory 
(i.e. the Ce) is thought of as fixed and the data as random, 



FIG. 2: The COBE-DMR power spectrum. The vertical 
bands display the marginalized densities at each £. Horizon- 
tal bars mark off bins of constant probability. These bins are 
assigned their color in Ce space and then projected into the di- 
agram. The bin with the highest probability density is shown 
in black. The dark and light shaded regions are the 1-cr and 
2-a highest posterior density regions, respectively. The Ce are 
measured in units mK 2 in this and all subsequent figures. 



are thought of 



the variances on our sky og = ^ r 
as x 2 variates with 21+1 degrees of freedom. 

From a Bayesian perspective the data is fixed and our 
knowledge of the underlying theory is uncertain — so our 
knowledge about the Ce is encoded in the inverse Gamma 
distributions (2£- 1), Eq. @. 

The mean and variance of the inverse Gamma distri- 
bution of order d are 



and 



(Ce) 



(AC, 2 ) = 



01 
d-2 



2 a 



d>2 7 



(d-4) (d-2)" 



d>4. 



(22) 



(23) 



For the case of a flat prior P(Ce) — const we obtain 
d = 21 — 1 . In this case the variance only becomes finite 
for t > 2. This is a result of having chosen a flat prior 
for a variance measurement. There are in fact no argu- 
ments for doing so — when measuring a variance (which 
is a scale parameter) a flat prior does not correspond to 
maximal ignorance. 

The Jeffrey ignorance prior [23] for this case is P(Ce) = 
l/C(. This would lead to d — 21+1 and finite vari- 
ance for t > 1. In this case {Ce) — ^frr an d i^-Cj) — 

(2^-3) i2e~\f ' 

In order to obtain the frequentist expectation (Ce) — 
gffj the prior P(Ce) = 1/Cj would have to be used. In 
this case we still obtain a variance for the estimator which 
is larger by a factor ||±i than the frequentist chi-square 
variance [2l]]. So the mean of P(Ce\d) is an unbiased 
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estimator of Ci for perfect data and hence has the same 
expectation as the maximum likelihood estimator 17]. 

These considerations are potentially relevant to the 
discussion about the statistical significance of the low 
I Ci estimates in the WMAP data in the Bayesian ap- 
proach (e.g., 0] and references therein). We will explore 
this issue in more detail in a future publication. 



V. COMPUTATIONAL CONSIDERATIONS 

The computationally most demanding part of imple- 
menting this method is solving Eq. i|l(J|) and Eq. I|ll|) at 
each iteration of the Gibbs sampler. Each of these is a 
linear system of equations of the form Mv = w, where 
M = (1 + Si N' 1 ) . It should be noted that these sys- 
tems are of the same size as the map-making equation, 
Eq. J2J. Maps also have to be made for approximate 
estimators. Therefore we expect the computational com- 
plexity of the Gibbs sampler to be no worse than the 
computational complexity of an approximate method. 

For large n p (n p > 10 5 ) on the largest supercomput- 
ers available at the time of writing), direct solution of 
either of these equations becomes infeasible, because nei- 
ther of them are sparse. This means the operation count 
scales as rip and because the memory requirements for 
storing the coefficient matrices scales as n: \. Therefore 



large systems of this type are usually solved using iter- 
ative techniques, such as the conjugate gradient (CG) 
technique |27| . The memory savings can be very large: 
the components of M do not have to be stored as long as 
matrix vector products Mv can be computed somehow. 
In terms of CPU time, iterative techniques outperform 
direct techniques if either Mv can be computed in less 
than rip operations or the number of iterations required 
to converge to a solution of sufficient accuracy is much 
less than n p . 

We chose to write Eq. ljTTj|l and Eq. (fTTfl in a form 
which satisfies all of theses requirements. The memory 
required is of order n p since we never need to store the 
components of the coefficient matrix. 

The action of any power of S on a vector can be com- 
puted in much less than rip operations using spherical 
harmonic transforms (or FFTs in the flat sky approxi- 
mation). The action of iV _1 = A T N to ^ l A on a vector is 
generally easier to compute than the action of N on a 
vector. As long as noise correlations can be modeled in 
a simple way in the time-domain (e.g. as piecewise sta- 
tionary) the time required for applying iV _1 to a vector 
is similar to that required for a forward simulation of the 
data. 

The number of CG iterations until convergence can be 
reduced far below the theoretical maximum n p if M is 
nearly proportional to the unit matrix. This goal can be 
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FIG. 4: Correlation matrix of Ce estimates from the COBE-DMR data. The diagonal components have been set to zero to 
enhance the contrast of the off-diagonal components. The surface is shaded according to height. We see that correlations between 
the power spectrum estimates vary between 8% correlation at (£,£') = (6, 10) and 15% anti-correlation at (£,£') = (8, 12). See 
Figure |K| 



approached by finding an approximate inverse for M , a 
preconditioner. 

If iV -1 were diagonal in the spherical harmonic basis, 
M would be, too. Therefore, as long as this is approx- 
imately true on scales where S 3> N , a good precondi- 
tioner for this system would be the inverse of the diagonal 
part of M in the spherical harmonic basis. These are easy 
to compute if we approximate the diagonal components 
of iV -1 by counting the number of TOD samples in each 
pixel and weighting by the current noise temperature of 
the detector. Due to the way Eq. (|10|l and Eq. Ijllfl have 
been written, the structure of N^ 1 in the noise domi- 
nated regime does not matter, since if S <C N, M w 1. 

3 

This preconditioner can be computed in O(rip) oper- 
ations. Figure n shows the results of a timing study for 
simulated data sets of varying size with WMAP-like scan- 
ning strategy and uncorrelated noise. The preconditioner 
performs well. The number of iterations does not increase 
with problem size over three orders of magnitude in n p 
and the computing time is is dominated by the spherical 
harmonic transforms. 



VI. APPLICATION TO THE COBE DATA 

In order to test our method we applied to the well- 
studied COBE-DMR data. The exact maximum likeli- 
hood estimator [(J [2^, and the least square quadratic 
estimator [j| have been computed for this data set. How- 
ever, even for this small data set, the marginalized prob- 



ability densities of each individual Ci, or the joint pos- 
terior density of pairs of Cg have not been computed be- 
cause doing so would require numerical integration over 
~ 20 dimensions. We will show these densities here for 
the first time. 

The COBE-DMR data 0] is published in the quad- 
cube data structure, at a resolution which has 6144 pix- 
els on the sphere. We use a noise-weighted average of 
the 53GHz and 90GHz maps. Because much of our code 
was already written for a HealPix data structure, we put 
the COBE data into a HealPix pixelization at resolution 
n S ide — 64 with 49152 pixels. HealPix pixels whose cen- 
ters lie within the same quadcube pixel get the same data 
(temperature) value. 

Because the noise is completely correlated between sets 
of HealPix pixels in the same quadcube pixel, the noise 
covariance matrix N is block diagonal, where each ele- 
ment of the block is a 2 , the published (noise) variance of 
that quadcube pixel. This means that N is not strictly 
invertible, so we have to use a pseudo-inverse for N^ 1 . 
Our pseudo-inverse is also block diagonal, with constant- 
valued blocks, and correctly inverts the action of N on a 
vector that is constant valued on the same blocks as N. 

We project out the mean and dipoles from the un- 
cut region of the COBE-DMR map, and model the data 
within the custom galactic cut as Gaussian random white 
noise with large variance. This corresponds to claim- 
ing complete ignorance of the foregrounds at low galactic 
latitudes (within the custom cut) and assuming that no 
residual foregrounds are present at high latitudes (out- 
side the cut region). This is the simplest possible way of 
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treating the monopole, dipole, and galactic foregrounds. 

Our noise matrix has values published by the COBE 
team, but with the a 1 noise variance increased to 
1000 mK 2 in the galactic cut region, a numerically large 
value that exceeds any other variance in the problem. 

For the first iteration of the Gibbs sampler we choose 

1(1-4 

Co = d = HT 30 mK 2 Cg = mK 2 . (24) 

We chose these values because they very roughly approx- 
imate the true Cg values to reduce burn-in time. The 
first two are numerically small, because we consider the 
monopole and dipole to be non-cosmological. During the 
Cg estimation step of the Gibbs sampler, the Co and C\ 
values are not changed. This corresponds to enforcing 
the prior that the cosmological signal does not contain 
such components. 

The Gibbs sampler is run through 10, 000 iterations 
(sets of Cg values). This takes approximately 24 hours 
on an Athlon XP1800+ workstation. We ignore the 
first 1000 iterations to ensure that the Gibbs sampler 



has converged to the true distribution. This is very 
conservative — in fact by computing correlations of our 
Ci draws along the chain we infer that about every 20 th 
sample is uncorrelated. 

We plot the power spectrum in figure |2 For each I 
value, we display vertically a binned representation of 
the marginalized posterior densities P(Ci\m). The bins 
all hold an equal number of points. The bins that are 
thinnest (points are densest in Cg space) are colored more 
darkly. The top 68% are dark gray; from 68% to 95% are 
lighter gray, and the rest are white. The highest density 
bin is shown in black. 

To explore the marginalized posterior Cg distribution 
in more detail we plot their histograms in Figure [3J It is 
noteworthy that not a single one of these is even nearly 
Gaussian. Within the context of the discussion of the lack 
of large scale power in the CMB, it is worth pointing out 
that all inferences about C2 from COBE-DMR can be 
based on the P(C-i\d) shown here. 

The correlation structure of the estimates contains in- 
formation about how well we were able to account for the 
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FIG. 6: Reconstructed signal maps in Galactic coordinates. 
A: The signal component of the COBE-DMR data marginal- 
ized over the power spectrum: (x)\p( x \ m ). This is a general- 
ized Wiener filter which does not require knowing the signal 
covariance a priori. B: The solution y of Eq. li lt at one Gibbs 
iteration. C: The sample pure signal sky s = x + y at the 
same iteration (band-limited at £ ma x = 50). D: The WMAP 
internal linear combination map smoothed to an FWHM of 
5 degrees. The corresponding features in parts A and D are 
clearly visible. Note that in this map low galactic latitudes 
are not masked, which leads to some artifacts that are not 
visible in the masked COBE-DMR data. 



effects of the galactic cut. It is clear from figure 0] that 
the residual correlations are at most of order 10% even 
at very small £. 

However, since the posterior densities are non- 
Gaussian, the two-point correlations do not contain all 
the information. We therefore show the marginalized 
posteriors for four pairs of Cts in figure 03 Again, all 
four of these densities are strongly non-Gaussian. 

Lastly, we show the reconstructed signals. Figure 
shows the expectation of the signal component (the solu- 
tion of Eq. I|10|) at each iteration of the Gibbs sampler) of 
the COBE-DMR data with respect to the posterior den- 
sity marginalized over the power spectrum: (a;)|p( x | m )- 
This is a generalized Wiener filter (GWF) which does 
not require knowing the signal covariance a priori. The 
smoothing of the map autmatically adapts locally de- 
pending on how much detail the data support. The more 
strongly smoothed central horizontal band was obscured 
by the galaxy. Still the GWF reconstructs large scale 
modes in the galactic cut. 

The power spectrum of figure|HJV would be biased low, 
since the Wiener filter removes everything that could be 
noise. At each iteration of the Gibbs sampler the solution 
to Eq. (|11|) (shown in figure EJ3) adds in a fluctuating 
term that replaces filtered noise with synthetic signal. It 
is noticable that this fill-in signal is larger in the regions 
of the map where the Galaxy obscures the CMB. The 
resulting draw s from Eq. JfjJ) is shown in figure|Hp. Every 
s draw is one possible pure signal sky that could have 
given rise to the data. Since we know that the COBE 
data has no statistical power above an £ max of about 20, 
we imposed a bandlimit of i max = 50. 

For comparison with the inferences we draw from the 
COBE-DMR data, we show in figure HJ) the internal 
linear combination map from the WMAP satellite Q 
smoothed down to five degrees FWHM, an intermedi- 
ate scale between the slightly larger average smoothing 
of panel A and the somewhat smaller smoothing of panel 
B and C due to the bandlimit of£ max = 50. Nearly every 
hot and cold spot that is identified by the GWF can be 
found in the high signal-to-noise WMAP data. Figure 
fills in signal very plausibly up to the imposed ban- 
dlimit. Even more striking is the similarity of our figure 
EJ^. to the combination of Q and V band WMAP data 
shown in figure 8 of Q, which is intended to mimic the 
COBE-DMR 53GHz map. 



VII. CONCLUSIONS AND FUTURE WORK 

We have described a framework for global and loss- 
less analysis of cosmic microwave background data. This 
framework is based on a Bayesian analysis of CMB data. 
It has several advantages compared to traditional meth- 
ods. It is computationally feasible. It is optimal and 
exact under the assumption of Gaussian fields and the 
ability to encode our prior knowledge about foreground 
components in terms of multivariate Gaussian densities. 
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It uses controlled approximations (e.g. the number of 
samples of the Gibbs sampler controls the accuracy of 
the result but this can be increased by spending more 
computing time). It allows joint analysis of the CMB sig- 
nal, foregrounds and noise properties of the instrument, 
while modeling and exploiting the statistical dependence 
between these different inferences. 

Traditional methods of inference from CMB data di- 
vide the data analysis into several steps: map-making 
from TOD, component separation, power spectrum es- 
timation from the CMB signal and cosmological param- 
eter estimation. Our method allows treating all these 
inferences jointly and self-consistently, if desired. The 
traditional results can be understood as special cases of 
our method for certain uninformative prior choices. For 
example, pure map-making could be viewed as applying 
this framework with P(s\Ci)P(f)P(Ci) — const. 

In spite of this generality, the framework for analyz- 
ing CMB data described here is very modular: the struc- 
ture of the Gibbs sampling scheme separates the different 
steps of the inference process focusing on each component 
in turn. The framework described here therefore holds 
the promise of making more data analysis steps part of 
a self-consistent framework rather than sequential stages 
in a data pipeline. 

Our method turns out to give an unbiased Wiener fil- 
ter and generalizes the global filtering and reconstruction 
methods in |25| to include power spectrum estimation, 
obviating the need for a priori knowledge of the signal 
covariance. 

We require the use of iterative techniques to solve the 
most computationally demanding step in this method. 
We find that our simple-minded preconditioned gradi- 
ent iteration works well over 3 orders of magnitude in 
problem size. It remains to be studied whether other 
preconditioners may be even more effective (e.g. [H|). 

We applied our formalism to the well-studied COBE- 
DMR data set. We demonstrate that our methods enable 
new analyses for even such a small data set. We quote 
posterior densities for individual CV as well as posterior 
densities for pairs of Ci as examples of results that would 
be prohibitively expensive to obtain with traditional al- 
gorithms. Our results are consistent with the most so- 
phisticated brute force 0(n,p) analyses available in the 
literature. 

The approach can be extended straightforwardly to po- 
larized maps, data that spans different frequency bands 
and joint estimation of different data sets. There is noth- 
ing that prevents the application of these ideas to random 
fields on manifolds other than the sphere, such as one-, 
two- or three-dimensional Euclidean space. We are inves- 
tigating the formalism for joint inference from CMB data 
about the power spectrum and map of the pure CMB sky 
with the power spectrum and map of the projected gravi- 
tational potential. We will report on these developments 
in a future publication. 
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APPENDIX A: ANALYTIC MARGINALIZATION 
OVER FOREGROUNDS 

From a statistical point of view we consider the a pos- 
teriori distribution to be a function of the CMB signal, 
Ci and the foregrounds. Then we marginalize over the 
foregrounds /. This can be done either implicitly through 
Gibbs sampling from the joint posterior density including 
/ (as described in the main text) and then marginalizing 
over / in the generated samples or explicitly through an- 
alytic marginalization of the posterior over /. Then the 
ignorance about the foreground is included in additional 
terms in the noise covariance matrix. The first route is 
more general, but there may be occasions where the sec- 
ond is preferable; for example if the main goal is to make 
the CMB analysis insensitive to a small number of known 
foreground templates /j. 

The effect of analytic marginalization is that the new 
noise covariance matrix TV' including the foreground term 
becomes 

N' = N + a}FF T = N + a}Y J fifi- ( A1 ) 

i 

In order to implement the Gibbs sampler including this 
new term we need to be able to apply N' 1 to vectors. 
If only a small number (up to ~ 1 , 000) of foreground 
templates need to be projected out this can best be done 
by grouping all the vectors using the following limit of 
the Sherman-Morrison- Woodbury formula j25j 




TV" 1 - [N _1 F (F T N~ 1 F)~ 1 F T N~ 1 )] . 

This operation will project out the directions in iV _1 
corresponding to the foreground contributions. The ac- 
tion of this new inverse noise covariance matrix on a vec- 
tor can be computed using methods similar to those de- 
scribed in Eqs. (2.7.16$) in p7|. 

Alternatively one can solve iteratively the set of equa- 
tions 

(N + o-)FF T ) x = v (A3) 

every time x = N'^v is required on the LHS of Eq. i|10|) 
and Eq. ifTTj). When N'~i is required for the RHS of Eq. 
(|llfl we solve 

(N + cr 2 f FF T )y = N^ + o-fFX- (A4) 
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The term Ni£ is obtained by simulating a noise-only However, the method in the main text is more flexible, 
map solving Eq. (J2J with d — n to d- In both of these since it allows grouping different foregrounds together in 
equations one can choose at numerically large. ways that are computationally convenient. 
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